11 research outputs found
EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices
In recent years, advances in deep learning have resulted in unprecedented
leaps in diverse tasks spanning from speech and object recognition to context
awareness and health monitoring. As a result, an increasing number of
AI-enabled applications are being developed targeting ubiquitous and mobile
devices. While deep neural networks (DNNs) are getting bigger and more complex,
they also impose a heavy computational and energy burden on the host devices,
which has led to the integration of various specialized processors in commodity
devices. Given the broad range of competing DNN architectures and the
heterogeneity of the target hardware, there is an emerging need to understand
the compatibility between DNN-platform pairs and the expected performance
benefits on each platform. This work attempts to demystify this landscape by
systematically evaluating a collection of state-of-the-art DNNs on a wide
variety of commodity devices. In this respect, we identify potential
bottlenecks in each architecture and provide important guidelines that can
assist the community in the co-design of more efficient DNNs and accelerators.Comment: Accepted at MobiSys 2019: 3rd International Workshop on Embedded and
Mobile Deep Learning (EMDL), 201
Maestro: Uncovering Low-Rank Structures via Trainable Decomposition
Deep Neural Networks (DNNs) have been a large driver and enabler for AI
breakthroughs in recent years. These models have been getting larger in their
attempt to become more accurate and tackle new upcoming use-cases, including
AR/VR and intelligent assistants. However, the training process of such large
models is a costly and time-consuming process, which typically yields a single
model to fit all targets. To mitigate this, various techniques have been
proposed in the literature, including pruning, sparsification or quantization
of the model weights and updates. While able to achieve high compression rates,
they often incur computational overheads or accuracy penalties. Alternatively,
factorization methods have been leveraged to incorporate low-rank compression
in the training process. Similarly, such techniques (e.g.,~SVD) frequently rely
on the computationally expensive decomposition of layers and are potentially
sub-optimal for non-linear models, such as DNNs. In this work, we take a
further step in designing efficient low-rank models and propose Maestro, a
framework for trainable low-rank layers. Instead of regularly applying a priori
decompositions such as SVD, the low-rank structure is built into the training
process through a generalized variant of Ordered Dropout. This method imposes
an importance ordering via sampling on the decomposed DNN structure. Our
theoretical analysis demonstrates that our method recovers the SVD
decomposition of linear mapping on uniformly distributed data and PCA for
linear autoencoders. We further apply our technique on DNNs and empirically
illustrate that Maestro enables the extraction of lower footprint models that
preserve model performance while allowing for graceful accuracy-latency
tradeoff for the deployment to devices of different capabilities.Comment: Under revie
Multi-Exit Semantic Segmentation Networks
Semantic segmentation arises as the backbone of many vision systems, spanning
from self-driving cars and robot navigation to augmented reality and
teleconferencing. Frequently operating under stringent latency constraints
within a limited resource envelope, optimising for efficient execution becomes
important. At the same time, the heterogeneous capabilities of the target
platforms and the diverse constraints of different applications require the
design and training of multiple target-specific segmentation models, leading to
excessive maintenance costs. To this end, we propose a framework for converting
state-of-the-art segmentation CNNs to Multi-Exit Semantic Segmentation (MESS)
networks: specially trained models that employ parametrised early exits along
their depth to i) dynamically save computation during inference on easier
samples and ii) save training and maintenance cost by offering a post-training
customisable speed-accuracy trade-off. Designing and training such networks
naively can hurt performance. Thus, we propose a novel two-staged training
scheme for multi-exit networks. Furthermore, the parametrisation of MESS
enables co-optimising the number, placement and architecture of the attached
segmentation heads along with the exit policy, upon deployment via exhaustive
search in <1 GPUh. This allows MESS to rapidly adapt to the device capabilities
and application requirements for each target use-case, offering a
train-once-deploy-everywhere solution. MESS variants achieve latency gains of
up to 2.83x with the same accuracy, or 5.33 pp higher accuracy for the same
computational budget, compared to the original backbone network. Lastly, MESS
delivers orders of magnitude faster architectural customisation, compared to
state-of-the-art techniques.Comment: (Extended version) Accepted at ECCV 202
HAPI: Hardware-Aware Progressive Inference
Convolutional neural networks (CNNs) have recently become the
state-of-the-art in a diversity of AI tasks. Despite their popularity, CNN
inference still comes at a high computational cost. A growing body of work aims
to alleviate this by exploiting the difference in the classification difficulty
among samples and early-exiting at different stages of the network.
Nevertheless, existing studies on early exiting have primarily focused on the
training scheme, without considering the use-case requirements or the
deployment platform. This work presents HAPI, a novel methodology for
generating high-performance early-exit networks by co-optimising the placement
of intermediate exits together with the early-exit strategy at inference time.
Furthermore, we propose an efficient design space exploration algorithm which
enables the faster traversal of a large number of alternative architectures and
generates the highest-performing design, tailored to the use-case requirements
and target hardware. Quantitative evaluation shows that our system consistently
outperforms alternative search mechanisms and state-of-the-art early-exit
schemes across various latency budgets. Moreover, it pushes further the
performance of highly optimised hand-crafted early-exit CNNs, delivering up to
5.11x speedup over lightweight models on imposed latency-driven SLAs for
embedded devices.Comment: Accepted at the 39th International Conference on Computer-Aided
Design (ICCAD), 202
SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud
Despite the soaring use of convolutional neural networks (CNNs) in mobile
applications, uniformly sustaining high-performance inference on mobile has
been elusive due to the excessive computational demands of modern CNNs and the
increasing diversity of deployed devices. A popular alternative comprises
offloading CNN processing to powerful cloud-based servers. Nevertheless, by
relying on the cloud to produce outputs, emerging mission-critical and
high-mobility applications, such as drone obstacle avoidance or interactive
applications, can suffer from the dynamic connectivity conditions and the
uncertain availability of the cloud. In this paper, we propose SPINN, a
distributed inference system that employs synergistic device-cloud computation
together with a progressive inference method to deliver fast and robust CNN
inference across diverse settings. The proposed system introduces a novel
scheduler that co-optimises the early-exit policy and the CNN splitting at run
time, in order to adapt to dynamic conditions and meet user-defined
service-level requirements. Quantitative evaluation illustrates that SPINN
outperforms its state-of-the-art collaborative inference counterparts by up to
2x in achieved throughput under varying network conditions, reduces the server
cost by up to 6.8x and improves accuracy by 20.7% under latency constraints,
while providing robust operation under uncertain connectivity conditions and
significant energy savings compared to cloud-centric execution.Comment: Accepted at the 26th Annual International Conference on Mobile
Computing and Networking (MobiCom), 202
H<sub>2</sub>O<sub>2</sub>-Enhanced As(III) Removal from Natural Waters by Fe(III) Coagulation at Neutral pH Values and Comparison with the Conventional Fe(II)-H<sub>2</sub>O<sub>2</sub> Fenton Process
Arsenic is a naturally occurring contaminant in waters, which is toxic and adversely affects human health. Therefore, treatment of water for arsenic removal is very important production of safe drinking water. Coagulation using Fe(III) salts is the most frequently applied technology for arsenic removal, but is efficient mostly for As(V) removal. As(III) removal usually requires the application of a pre-oxidation step, which is mainly conducted by chemical or biological means. In this study, we show that Fe(III) coagulation in the presence of H2O2 can be a very efficient treatment process for As(III) removal, which has been never been shown before in the literature. The results showed that addition of 8.7–43.7 mM hydrogen peroxide to Fe(III) coagulation process was able to increase the effectiveness of As(III) removal in synthetic groundwater by 15–20% providing residual concentrations well below the regulatory limit of 10 μg/L from initial As(III) concentrations of 100 μg/L, at pH 7. The enhanced coagulation process was affected by the solution pH. The removal efficiency substantially declined at alkaline pH values (pH > 8). Addition of EDTA in the absence of H2O2 had a strong inhibiting effect where the As(III) removal was almost zero when 88.38 μΜ EDTA were used. Radical quenching experiments with 50, 100 and 200 mM DMSO, methanol and 2-propanol in the H2O2-coagulation process had a slightly adverse effect on the removal efficiency. This is considered as indicative of an adsorption/oxidation of As(III) process onto or very near the surface of iron oxide particles, formed by the hydrolysis of Ferric iron ions. In practice, the results suggest that addition of H2O2 increases the As(III) removal efficiency for Fe(III) coagulation systems. This is an important finding because the pre-oxidation step can be omitted with the addition of H2O2 while treating water contaminated with As(III)
Tape SCSI monitoring and encryption at CERN
CERN currently manages the largest data archive in the HEP domain; over 180PB of custodial data is archived across 7 enterprise tape libraries containing more than 25,000 tapes and using over 100 tape drives. Archival storage at this scale requires a leading edge monitoring infrastructure that acquires live and lifelong metrics from the hardware in order to assess and proactively identify potential drive and media level issues. In addition, protecting the privacy of sensitive archival data is becoming increasingly important and with it the need for a scalable, compute-efficient and cost-effective solution for data encryption. In this paper, we first describe the implementation of acquiring tape medium and drive related metrics reported by the SCSI interface and its integration with our monitoring system. We then address the incorporation of tape drive real-time encryption with dedicated drive hardware into the CASTOR [1] hierarchical mass storage system